Analysis of Categorical Data

Analysis of Categorical DataIntroduction to categorical dataChi-Square TestsChi-square Goodness of Fit (GoF) testChi-square tests of independence

Introduction to categorical data

Many experiments result in measurements that are qualitative (categorical) rather than quantitative (numbers).

Example

$n$ manufactured items are sampled and categorized into "acceptable", "seconds", or "rejects"
$n$ employees are surveyed and classified into one of five income brackets.

$X_i$ $k$ $X_{ij}=1$ $X_i$ $j$ $0$ $j=1,\dots,k$ .

$(X_1,\dots,X_n)$ counts $N = (N_1,\dots,N_k)$ $N_j = \sum_{i=1}^n X_{ij}$ .

Definition (multinomial distribution) Consider a random experiment such that

$n$ independent and identical trials, and
$k$ distinct categories.

$N_1,\dots,N_{k}$ $n$ $p_1,\dots,p_k$ ${\rm Mult}(n,p_1,\dots,p_k)$ .

${\rm Mult}(n,p_1,\dots,p_k)$ is

p(n_1,\dots,n_k) = \frac{n!}{n_1!n_2!\dots n_k!} p_1^{n_1}\dots p_k^{n_k}

$1$ $p_1+\dots+p_k=1$

Remark 1: $k=2$ ${\rm Bin}(n,p)={\rm Mult}(n,p_1,1-p_1)$ .

Remark 2: $X_1,\dots,X_n$ $X_{i}\sim {\rm Mult}(1,p_1,p_2,\dots,p_k)$ $N = \sum_{i=1}^n X_i \sim {\rm Mult}(n,p_1,\dots,p_k)$ .

$(n_1,\dots,n_k)$ $N \sim {\rm Mult}(n,p_1,\dots,p_k)$ $p_1, p_2,..., p_k$ .

Chi-Square Tests

Setting

$(X_1,\dots,X_n)$ $X_i \sim {\rm Mult}(1,p_1,\dots,p_k)$ $N = (N_{1},\dots,N_{k})$ $N \sim {\rm Mult}(n, p_1,\dots,p_k)$ .

Null and alternative hypothesis:

$H_0: p_1 = p_{10},\dots,p_k = p_{k0}$
$H_1:$ $p_i$ $p_{i0}$ $i=1\dots,k$ ,
$p_{10},\dots,p_{k0}$ $\sum_{i=1}^k p_{i0} = 1$ .

Test statistic:

Q =\sum_{i=1}^k \frac{{\rm (Observed\,Count-Expected\,Count)^2}}{{\rm Expected\,Count}}= \sum_{i=1}^k \frac{(N_i - n p_{i0})^2}{n p_{i0}}

Null distribution:
$Q \underset{H_0}{\dot\sim} \chi^2(k-1)$

$\alpha$ $q_{obs}$ $\{q_{obs}; q_{obs} \ge \chi^2_{\alpha}(k-1)\}$ $q_{obs} =\sum_{i=1}^k \frac{(n_i - n p_{i0})^2}{n p_{i0}}$ .
$H_0$ $q_{obs}$ $\chi^2_\alpha(k-1)$ $\alpha$ test.

Remark 1: $n$ $\chi^2$ $n p_{i}\ge 5$ $i=1,\dots,k$ $5$ or not.

Remark 2: $-2\log \lambda\approx Q$ $n$ $\chi^2$ distribution is the difference of the number of free parameters in the null and full parameter spaces.

$0$
$k-1$ .

Example: $n = 90$ $n_1 = 23, n_2 = 36$ $n_3 = 31$ .

Chi-square Goodness of Fit (GoF) test

$\chi^2$ $(x_1,\dots,x_n)$ $F_0$ or not.

Idea: $X$ $k$ $A_1,\dots,A_k$ ) and compare observed and expected counts for each region.

$H_0$ $X\sim F_0$ $H_1: X \sim F\ne F_0$ $\rightarrow$
$H_0: (p_1,\dots,p_k) = (p_{10},\dots,p_{k0})$ $H_1$ $p_i$ $p_{i0}$ $p_{i0} = P_{X\sim F_0}(X \in A_i )$ .

Example $X$ $1/2$ $X$ ${\rm Bin}(4, 1/2)$ $0, 1, 2, 3$ $4$ $7, 18, 40, 31$ $4$ $\alpha=.05$ .

$X$ $X$ $H_0$ .

Null and alternative hypotheses

$H_0$ $X\sim F_\theta$ $H_1: X \sim F\ne F_\theta$ $\rightarrow$
$H_0: (p_1,\dots,p_k) = (p_{10}(\theta),\dots,p_{k0}(\theta))$ $H_1$ $p_i$ $p_{i0}(\theta)$ $p_{i0}(\theta) = P_{X\sim F_{\theta}}(X \in A_i )$ .

Test statistic:

Q =\sum_{i=1}^k \frac{{\rm (Observed\,Count-Expected\,Count^2}}{{\rm Expected\,Count } }= \sum_{i=1}^k \frac{(N_i - n p_{i0}(\hat{\theta}))^2}{n p_{i0}(\hat{\theta})}

$\hat{\theta}$ $\theta$ $H_0$ $X \sim F_\theta$ ).

Null distribution:
$Q \underset{H_0}{\dot\sim} \chi^2(k-1-s)$
$s$ $\theta$ $H_0$ ).
$\alpha$ $q_{obs}$ $\{q_{obs}; q_{obs} \ge \chi^2_{\alpha}(k-1-s)\}$ $q_{obs} =\sum_{i=1}^k \frac{(n_i - n p_{i0}(\hat{\theta}))^2}{n p_{i0}(\hat{\theta})}$ .
$H_0$ $q_{obs}$ $\chi^2_\alpha(k-1-s)$ $\alpha$ test.

Example $X$ $n = 50$ $50$ $32$ $12$ $6$ $2$ $X$ $α = .05$ .

Chi-square tests of independence

Two-way contingency table (a table for two categorical variables)

Example:

A statistics 415 instructor wants to know if there is a relationship between favorite color (red or yellow) and the preferred condiment on a corn dog. The following table summarizes the results.

		Condiment
Color	Ketchup	Mustard	Total
Red
Yellow
Total

A general example of our contingency table with two classifying factors can be displayed as follows.

$X_{ij}$ $(i,j)$ $i$ $j$ $X_{i.}=X_{i1}+X_{i2}+...+X_{ic}$ $i$ $X_{.j}=X_{1j}+X_{2j}+...+X_{cj}$ $X_{..}$ $n$ , the sample size.
$r\times c$ $r$ $c$ categories for the column variable

	$B_1$	$B_2$	$\cdots$	$B_{c}$	Total
$A_1$	$X_{11}$	$X_{12}$	$\cdots$	$X_{1c}$	$X_{1.}$
$A_2$	$X_{21}$	$X_{22}$	$\cdots$	$X_{2c}$	$X_{2.}$
$\vdots$	$\vdots$	$\vdots$	$\ddots$	$\vdots$	$\vdots$
$A_{r}$	$X_{r1}$	$X_{r2}$	$\cdots$	$X_{rc}$	$X_{r.}$
Total	$X_{.1}$	$X_{.2}$	$\cdots$	$X_{.c}$	$X_{..}=n$

$rc$ $X_{ij}$ $i=1,...,r$ $j=1,...,c$ can be modeled using a multinomial distribution

$n$ observations results in an outcome that can be classified by two attributes
- e.g, "Treatment/Control" on the rows, "Disease/No disease" on the columns
$r$ $A_1,...,A_{r}$ $c$ $B_1,...,B_c$
$p_{ij}=P(A_i\cap B_j)$

Chi-square tests of independence

Chi-square tests of independence aim to answer the question: for a single observation, is the row assignment statistically independent of the column assignment?

In terms of hypotheses, we will test

$H_0:P(A_i\cap B_j)=P(A_i)P(B_j)$ $i=1,...,r$ $j=1,...,c$

vs.

$H_1: P(A_{i}\cap B_j)\neq P(A_i)P(B_j)$ $i,j$ pair.

Defining

$p_{i.}=P(A_i),\, p_{.j}=P(B_j)$

$H_0$ $H_1$ as

$H_0:p_{ij}=p_{i.}p_{.j}$ $i=1,...,r$ $j=1,...,c$

$H_1:p_{ij}\neq p_{i.}p_{.j}$ $i,j$ pair

Test statistic:

$Q=\sum_{i=1}^{r}\sum_{j=1}^{c}\frac{\{X_{ij}-n(X_{i.}/n)(X_{.j}/n)\}^2}{n(X_{i.}/n)(X_{.j}/n)}=\sum_{i=1}^{r}\sum_{j=1}^{c}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}$

where

$(i,j)$ $O_{ij}=X_{ij}$
- $(i,j)$ $E_{ij}=n(X_{i.}/n)(X_{.j}/n)$

$H_0:p_{ij}=p_{i.}p_{.j}$ $Q$ $\chi^2$ $(r-1)(c-1)$ $Q\sim \chi^2((r-1)(c-1))$ .

$(r-1)+(c- 1)$
$rc-1$
$(rc-1)-[(r-1)+(c-1)]=(r-1)(c-1)$

$\alpha$ $q_{obs}$ $\{q_{obs}; q_{obs} \ge \chi^2_{\alpha}((r-1)(c-1))\}$ .

Large discrepancy between observed and expected counts = favors the alternative hypothesis

Example:

$\alpha=10\%$ .

		Condiment
Color	Ketchup	Mustard	Total
Red
Yellow
Total